1 Introduction

This dataset is based on “Bank Marketing” UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns).

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe a term deposit (variable y).

1.1 Dataset

1.1.1 Available features

1.1.1.1 Bank client data:

  • Age (numeric)
  • Job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
  • Marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’ ; note: ‘divorced’ means divorced or widowed)
  • Education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
  • Default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

1.1.1.3 Other attributes:

  • Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • Previous: number of contacts performed before this campaign and for this client (numeric)
  • Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

1.1.1.4 Social and economic context attributes:

  • Emp.var.rate: employment variation rate - quarterly indicator (numeric)
  • Cons.price.idx: consumer price index - monthly indicator (numeric)
  • Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  • Euribor3m: euribor 3 month rate - daily indicator (numeric)
  • Nr.employed: number of employees - quarterly indicator (numeric)

1.1.1.5 Output variable (desired target):

  • y: has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

1.1.2 First insight

1.1.2.2 Dataset importation

## [1] 41188    21

The dataset has 41,188 rows and 21 columns.

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

The first 20 variables are our potential explanatory variables and the last one (“y”) is the dependent variable.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##           |        no |       yes | 
##           |-----------|-----------|
##           |     36548 |      4640 | 
##           |     0.887 |     0.113 | 
##           |-----------|-----------|
## 
## 
## 
## 

This is an unbalanced two-levels categorical variable, 88.7% of values taken are “no” (or “0”) and only 11.3% of the remaining values are “yes” (or “1”). It is more natural to work with a 0/1 dependent variable:

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1         mon      261        1   999        0 nonexistent          1.1
## 2         mon      149        1   999        0 nonexistent          1.1
## 3         mon      226        1   999        0 nonexistent          1.1
## 4         mon      151        1   999        0 nonexistent          1.1
## 5         mon      307        1   999        0 nonexistent          1.1
## 6         mon      198        1   999        0 nonexistent          1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1         93.994         -36.4     4.857        5191 0
## 2         93.994         -36.4     4.857        5191 0
## 3         93.994         -36.4     4.857        5191 0
## 4         93.994         -36.4     4.857        5191 0
## 5         93.994         -36.4     4.857        5191 0
## 6         93.994         -36.4     4.857        5191 0
## [1] 0

There’s no missing value in the dataset. However, according to the data documentation, “unknown” value means NA.

## [1] 12718

There are 12,718 unknown values in the dataset, let’s try to find out which variables suffer the most from those NA values.

##          variable nr_unknown
## 1         default       8597
## 2       education       1731
## 3         housing        990
## 4            loan        990
## 5             job        330
## 6         marital         80
## 7             age          0
## 8         contact          0
## 9           month          0
## 10    day_of_week          0
## 11       duration          0
## 12       campaign          0
## 13          pdays          0
## 14       previous          0
## 15       poutcome          0
## 16   emp.var.rate          0
## 17 cons.price.idx          0
## 18  cons.conf.idx          0
## 19      euribor3m          0
## 20    nr.employed          0
## 21              y          0

6 features have at least 1 unknown value. Before deciding how to manage those missing values, we’ll study each variable and take a decision after visualisations. We can’t afford to delete 8,597 rows in our dataset, it’s more than 20% of our observations.

1.2 Exploratory Analysis

1.2.1 Univariate analysis

Each feature will be checked one at a time. We’ll eventually drop or transform some variables in order to clean up our dataset a little bit.

Those are a few functions that will be useful later.

theme_set(theme_bw())

# setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
                                                hjust = 1,
                                                vjust = 0.5),
                     axis.text.y = element_blank(),
                     axis.ticks.y = element_blank())

# setting default parameters for crosstables
fun_crosstable = function(df, var1, var2){
  CrossTable(df[, var1], df[, var2],
             prop.r = T,
             prop.c = F,
             prop.t = F,
             prop.chisq = F,
             dnn = c(var1, var2))
}

# plot weighted lm/leoss regressions with frequencies
fun_gg_freq = function(var){
  weight = table(bank_data[, var]) %>% 
    as.data.frame %>% 
    mutate(x = as.numeric(as.character(Var1))) %>% 
    select(-Var1) %>% 
    rename(weight = Freq)
  
  sink(tempfile())
  freq = fun_crosstable(bank_data, var, "y")$prop.r %>% 
    as.data.frame %>% 
    mutate(x = as.numeric(as.character(x)))
  sink()
  
  both = freq %>% 
    left_join(weight, by = "x") %>% 
    filter(weight > 50 & y == 1)
  
  gg = both %>% 
    ggplot() +
    aes(x = x,
        y = Freq,
        weight = weight) +
    geom_point(aes(size = weight)) +
    geom_smooth(aes(colour = "blue"), method = "loess") +
    geom_smooth(aes(colour = "red"), method = "lm", se = F) +
    coord_cartesian(ylim = c(-0.1, 1)) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "pt")) +
    xlab(var) +
    ylab("") +
    scale_x_continuous(position = "top") +
    scale_colour_manual(values = c("blue", "red"),
                        labels = c("loess", "lm")) +
    labs(colour = "Regression")
  
  return(gg)
}

# re-ordering levels from factor variable
fun_reorder_levels = function(df, variable, first){
  remaining = unique(df[, variable])[which(unique(df[, variable]) != first)]
  x = factor(df[, variable], levels = c(first, remaining))
  return(x)
}

# plotting importance from predictive models into two panels
fun_imp_ggplot_split = function(model){
  if (class(model)[1] == "ranger"){
    imp_df = model$variable.importance %>% 
      data.frame("Overall" = .) %>% 
      rownames_to_column() %>% 
      rename(variable = rowname) %>% 
      arrange(-Overall)
  } else {
    imp_df = varImp(model) %>%
      rownames_to_column() %>% 
      rename(variable = rowname) %>% 
      arrange(-Overall)
  }
  
  gg1 = imp_df %>% 
    slice(1:floor(nrow(.)/2)) %>% 
    ggplot() +
    aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
    geom_bar() +
    coord_flip() +
    xlab("Variables") +
    ylab("Importance") +
    theme(legend.position = "none")
  
  imp_range = ggplot_build(gg1)[["layout"]][["panel_params"]][[1]][["x.range"]]
  imp_gradient = scale_fill_gradient(limits = c(-imp_range[2], -imp_range[1]),
                                     low = "#132B43", 
                                     high = "#56B1F7")
  
  gg2 = imp_df %>% 
    slice(floor(nrow(.)/2)+1:nrow(.)) %>% 
    ggplot() +
    aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
    geom_bar() +
    coord_flip() +
    xlab("") +
    ylab("Importance") +
    theme(legend.position = "none") +
    ylim(imp_range) +
    imp_gradient
  
  gg_both = plot_grid(gg1 + imp_gradient,
                      gg2)
  
  return(gg_both)
}

# plotting two performance measures
fun_gg_cutoff = function(score, obs, measure1, measure2) {
  predictions = prediction(score, obs)
  performance1 = performance(predictions, measure1)
  performance2 = performance(predictions, measure2)
  
  df1 = data.frame(x = performance1@x.values[[1]],
                   y = performance1@y.values[[1]],
                   measure = measure1,
                   stringsAsFactors = F)
  df2 = data.frame(x = performance2@x.values[[1]],
                   y = performance2@y.values[[1]],
                   measure = measure2,
                   stringsAsFactors = F)
  df = df1 %>% 
    bind_rows(df2)
  
  gg = df %>% 
    ggplot() +
    aes(x = x,
        y = y,
        colour = measure) +
    geom_line() +
    xlim(c(0, 1)) +
    ylab("") +
    xlab("Threshold")
  
  return(gg)
}

# creating classes according to score and cut
fun_cut_predict = function(score, cut) {
  classes = score
  classes[classes > cut] = 1
  classes[classes <= cut] = 0
  classes = as.factor(classes)
  
  return(classes)  
}

1.2.1.1 Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   32.00   38.00   40.02   47.00   98.00

Ages range from 17 to 98, there doesn’t seem anything strange from there. The other summary statistics are fine, the average is 40 years old.

After the 60-years threshold, the relative frequency is higher when y = 1. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.

## # A tibble: 4 x 4
## # Groups:   elder60 [2]
##   elder60 y     abs_freq relative_freq
##   <chr>   <fct>    <int>         <dbl>
## 1 0       0        36052         98.6 
## 2 0       1         4226         91.1 
## 3 1       0          496          1.36
## 4 1       1          414          8.92

Elderly persons represent 8.92% of observations who accepted to subscribe to a term deposit, this proportion decreases to 1.36% for non subscribers.

We can also slice at 30 years to make three easily interpretable classes : [0, 30[, [30, 60[ and [60, +Inf[. The minimum and maximum values are 17 and 98 but we can expect new observations outside this range. We’re replacing the continious variable “age” by this categorical variable.

Let’s cross it with our dependent variable.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##              | y 
##          age |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##         high |       496 |       414 |       910 | 
##              |     0.545 |     0.455 |     0.022 | 
## -------------|-----------|-----------|-----------|
##          low |      6259 |      1124 |      7383 | 
##              |     0.848 |     0.152 |     0.179 | 
## -------------|-----------|-----------|-----------|
##          mid |     29793 |      3102 |     32895 | 
##              |     0.906 |     0.094 |     0.799 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36548 |      4640 |     41188 | 
## -------------|-----------|-----------|-----------|
## 
## 

45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).

1.2.1.2 Job

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed       unknown 
##          1014           330

There are mainly admin. (white-collars?) and blue-collars in the dataset. We can notice 330 “unknown” values.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##               | y 
##           job |         0 |         1 | Row Total | 
## --------------|-----------|-----------|-----------|
##        admin. |      9070 |      1352 |     10422 | 
##               |     0.870 |     0.130 |     0.253 | 
## --------------|-----------|-----------|-----------|
##   blue-collar |      8616 |       638 |      9254 | 
##               |     0.931 |     0.069 |     0.225 | 
## --------------|-----------|-----------|-----------|
##  entrepreneur |      1332 |       124 |      1456 | 
##               |     0.915 |     0.085 |     0.035 | 
## --------------|-----------|-----------|-----------|
##     housemaid |       954 |       106 |      1060 | 
##               |     0.900 |     0.100 |     0.026 | 
## --------------|-----------|-----------|-----------|
##    management |      2596 |       328 |      2924 | 
##               |     0.888 |     0.112 |     0.071 | 
## --------------|-----------|-----------|-----------|
##       retired |      1286 |       434 |      1720 | 
##               |     0.748 |     0.252 |     0.042 | 
## --------------|-----------|-----------|-----------|
## self-employed |      1272 |       149 |      1421 | 
##               |     0.895 |     0.105 |     0.035 | 
## --------------|-----------|-----------|-----------|
##      services |      3646 |       323 |      3969 | 
##               |     0.919 |     0.081 |     0.096 | 
## --------------|-----------|-----------|-----------|
##       student |       600 |       275 |       875 | 
##               |     0.686 |     0.314 |     0.021 | 
## --------------|-----------|-----------|-----------|
##    technician |      6013 |       730 |      6743 | 
##               |     0.892 |     0.108 |     0.164 | 
## --------------|-----------|-----------|-----------|
##    unemployed |       870 |       144 |      1014 | 
##               |     0.858 |     0.142 |     0.025 | 
## --------------|-----------|-----------|-----------|
##       unknown |       293 |        37 |       330 | 
##               |     0.888 |     0.112 |     0.008 | 
## --------------|-----------|-----------|-----------|
##  Column Total |     36548 |      4640 |     41188 | 
## --------------|-----------|-----------|-----------|
## 
## 

The “unknown” level doesn’t show any important information and should be discarded from the data. We’ll remove rows containing this value in the “job” column.

Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription. Other levels range between 6.9% (blue-collar) and 13.0% (admin.).

1.2.1.3 Marital situation

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40858 
## 
##  
##              | y 
##      marital |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     divorced |      4126 |       473 |      4599 | 
##              |     0.897 |     0.103 |     0.113 | 
## -------------|-----------|-----------|-----------|
##      married |     22178 |      2516 |     24694 | 
##              |     0.898 |     0.102 |     0.604 | 
## -------------|-----------|-----------|-----------|
##       single |      9889 |      1605 |     11494 | 
##              |     0.860 |     0.140 |     0.281 | 
## -------------|-----------|-----------|-----------|
##      unknown |        62 |         9 |        71 | 
##              |     0.873 |     0.127 |     0.002 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36255 |      4603 |     40858 | 
## -------------|-----------|-----------|-----------|
## 
## 

For the same reasons as before, we’ll remove rows with “unknown” as value for this variable.

Celibates slightly subscribe more often (14.0%) to term deposits than others (divorced (10.3%) and married (10.2%)).

1.2.1.4 Education

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40787 
## 
##  
##                     | y 
##           education |         0 |         1 | Row Total | 
## --------------------|-----------|-----------|-----------|
##            basic.4y |      3695 |       423 |      4118 | 
##                     |     0.897 |     0.103 |     0.101 | 
## --------------------|-----------|-----------|-----------|
##            basic.6y |      2077 |       187 |      2264 | 
##                     |     0.917 |     0.083 |     0.056 | 
## --------------------|-----------|-----------|-----------|
##            basic.9y |      5536 |       470 |      6006 | 
##                     |     0.922 |     0.078 |     0.147 | 
## --------------------|-----------|-----------|-----------|
##         high.school |      8436 |      1028 |      9464 | 
##                     |     0.891 |     0.109 |     0.232 | 
## --------------------|-----------|-----------|-----------|
##          illiterate |        14 |         4 |        18 | 
##                     |     0.778 |     0.222 |     0.000 | 
## --------------------|-----------|-----------|-----------|
## professional.course |      4631 |       594 |      5225 | 
##                     |     0.886 |     0.114 |     0.128 | 
## --------------------|-----------|-----------|-----------|
##   university.degree |     10442 |      1654 |     12096 | 
##                     |     0.863 |     0.137 |     0.297 | 
## --------------------|-----------|-----------|-----------|
##             unknown |      1362 |       234 |      1596 | 
##                     |     0.853 |     0.147 |     0.039 | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     36193 |      4594 |     40787 | 
## --------------------|-----------|-----------|-----------|
## 
## 

The illiterate category has not enough observations to be statisticaly meaningful. We can’t discriminate illiterate people by using a pool made of 18 individuals only. Hence, those rows will be deleted from the dataset.

Among the 1,596 rows containing the “unknown” value, 234 of them subscribed to a term deposit. This is around 5% of the total group of subscribers. Since we’re facing a very unbalanced dependent variable situation, we can not afford to discard those rows. Since this category has the highest relative frequency of “y = 1” (14.7%), we’re going to add them in the “university.degree” level. This level has the second highest “y = 1” relative frequency (13.7%).

It seems to be a positive correlation between the number of years of education and the odds to subscribe to a term deposit.

1.2.1.5 Default

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##      default |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     28182 |      4155 |     32337 | 
##              |     0.872 |     0.128 |     0.793 | 
## -------------|-----------|-----------|-----------|
##      unknown |      7994 |       435 |      8429 | 
##              |     0.948 |     0.052 |     0.207 | 
## -------------|-----------|-----------|-----------|
##          yes |         3 |         0 |         3 | 
##              |     1.000 |     0.000 |     0.000 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 

This feature is certainly not usable. Only 3 individuals replied “yes” to the question “Do you have credit in default?”. People either answer “no” (79.3%) or don’t even reply (20.7%), which gives zero information in our case. This variable is removed from the dataset.

1.2.1.6 Housing

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##      housing |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     16416 |      2003 |     18419 | 
##              |     0.891 |     0.109 |     0.452 | 
## -------------|-----------|-----------|-----------|
##      unknown |       877 |       107 |       984 | 
##              |     0.891 |     0.109 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |     18886 |      2480 |     21366 | 
##              |     0.884 |     0.116 |     0.524 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 
## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$housing and bank_data$y
## X-squared = 5.4627, df = 2, p-value = 0.06513

The p-value associated to the Chi-squared test equals to 0.065, which is higher than a 0.05-threshold. So, for a confidence level of 95%, there’s no association between the dependent variable y and our feature housing. We’re removing it from the dataset.

1.2.1.7 Loan

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##         loan |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     29799 |      3806 |     33605 | 
##              |     0.887 |     0.113 |     0.824 | 
## -------------|-----------|-----------|-----------|
##      unknown |       877 |       107 |       984 | 
##              |     0.891 |     0.109 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |      5503 |       677 |      6180 | 
##              |     0.890 |     0.110 |     0.152 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 
## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$loan and bank_data$y
## X-squared = 0.86841, df = 2, p-value = 0.6478

The p-value associated to the Chi-squared test equals to 0.648, which is higher than a 0.01-threshold. So, for a confidence level of 99%, there’s no association between the dependent variable y and our feature loan. We’re also removing it from the dataset.

1.2.1.8 Contact

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##      contact |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     cellular |     22098 |      3815 |     25913 | 
##              |     0.853 |     0.147 |     0.636 | 
## -------------|-----------|-----------|-----------|
##    telephone |     14081 |       775 |     14856 | 
##              |     0.948 |     0.052 |     0.364 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 

This feature is really interesting, 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did. I can’t see any good reason to explain this.

1.2.1.9 Month

Recoding levels makes things easier.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##        month |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (03)mar |       267 |       274 |       541 | 
##              |     0.494 |     0.506 |     0.013 | 
## -------------|-----------|-----------|-----------|
##      (04)apr |      2082 |       536 |      2618 | 
##              |     0.795 |     0.205 |     0.064 | 
## -------------|-----------|-----------|-----------|
##      (05)may |     12734 |       882 |     13616 | 
##              |     0.935 |     0.065 |     0.334 | 
## -------------|-----------|-----------|-----------|
##      (06)jun |      4697 |       548 |      5245 | 
##              |     0.896 |     0.104 |     0.129 | 
## -------------|-----------|-----------|-----------|
##      (07)jul |      6471 |       642 |      7113 | 
##              |     0.910 |     0.090 |     0.174 | 
## -------------|-----------|-----------|-----------|
##      (08)aug |      5459 |       644 |      6103 | 
##              |     0.894 |     0.106 |     0.150 | 
## -------------|-----------|-----------|-----------|
##      (09)sep |       309 |       253 |       562 | 
##              |     0.550 |     0.450 |     0.014 | 
## -------------|-----------|-----------|-----------|
##      (10)oct |       396 |       311 |       707 | 
##              |     0.560 |     0.440 |     0.017 | 
## -------------|-----------|-----------|-----------|
##      (11)nov |      3672 |       412 |      4084 | 
##              |     0.899 |     0.101 |     0.100 | 
## -------------|-----------|-----------|-----------|
##      (12)dec |        92 |        88 |       180 | 
##              |     0.511 |     0.489 |     0.004 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 

First of all, we can notice that no contact has been made during January and February. The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%). Every month with a very low frequency of contact (march, september, october and december) show very good results (between 44% and 51% of subscribers). Except maybe for december, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.

1.2.1.10 Day of the week

For the same reason as before, we’re recoding the days of the week.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40769 
## 
##  
##              | y 
##  day_of_week |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (01)mon |      7578 |       841 |      8419 | 
##              |     0.900 |     0.100 |     0.207 | 
## -------------|-----------|-----------|-----------|
##      (02)tue |      7056 |       945 |      8001 | 
##              |     0.882 |     0.118 |     0.196 | 
## -------------|-----------|-----------|-----------|
##      (03)wed |      7116 |       934 |      8050 | 
##              |     0.884 |     0.116 |     0.197 | 
## -------------|-----------|-----------|-----------|
##      (04)thu |      7493 |      1031 |      8524 | 
##              |     0.879 |     0.121 |     0.209 | 
## -------------|-----------|-----------|-----------|
##      (05)fri |      6936 |       839 |      7775 | 
##              |     0.892 |     0.108 |     0.191 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36179 |      4590 |     40769 | 
## -------------|-----------|-----------|-----------|
## 
## 

Calls aren’t made during weekend days. If calls are evenly distributed between the different week days, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 10.0% of successful calls. However, those differences are small, which makes this feature not that important. It would’ve been interesting to see the attitude of responders from weekend calls.

1.2.1.11 Duration

Since the goal is to seek best candidates who will have the best odds to subscribe to a term deposit, the call duration can’t be predicted before. So this feature is removed as recommended.

1.2.1.12 Campaign

Calling more than ten times a same person during a single marketing campaign seems excessive. We’ll consider those as outliers, even if marketing harrassment a real thing. However, we can see that on the chart that harassment isn’t working at all.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  39912 
## 
##  
##              | y 
##     campaign |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |     15176 |      2268 |     17444 | 
##              |     0.870 |     0.130 |     0.437 | 
## -------------|-----------|-----------|-----------|
##            2 |      9280 |      1205 |     10485 | 
##              |     0.885 |     0.115 |     0.263 | 
## -------------|-----------|-----------|-----------|
##            3 |      4722 |       569 |      5291 | 
##              |     0.892 |     0.108 |     0.133 | 
## -------------|-----------|-----------|-----------|
##            4 |      2380 |       246 |      2626 | 
##              |     0.906 |     0.094 |     0.066 | 
## -------------|-----------|-----------|-----------|
##            5 |      1466 |       120 |      1586 | 
##              |     0.924 |     0.076 |     0.040 | 
## -------------|-----------|-----------|-----------|
##            6 |       892 |        74 |       966 | 
##              |     0.923 |     0.077 |     0.024 | 
## -------------|-----------|-----------|-----------|
##            7 |       583 |        38 |       621 | 
##              |     0.939 |     0.061 |     0.016 | 
## -------------|-----------|-----------|-----------|
##            8 |       378 |        16 |       394 | 
##              |     0.959 |     0.041 |     0.010 | 
## -------------|-----------|-----------|-----------|
##            9 |       260 |        17 |       277 | 
##              |     0.939 |     0.061 |     0.007 | 
## -------------|-----------|-----------|-----------|
##           10 |       211 |        11 |       222 | 
##              |     0.950 |     0.050 |     0.006 | 
## -------------|-----------|-----------|-----------|
## Column Total |     35348 |      4564 |     39912 | 
## -------------|-----------|-----------|-----------|
## 
## 

Let’s extract the row proportions when y = 1.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  39912 
## 
##  
##              | y 
##     campaign |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |     15176 |      2268 |     17444 | 
##              |     0.870 |     0.130 |     0.437 | 
## -------------|-----------|-----------|-----------|
##            2 |      9280 |      1205 |     10485 | 
##              |     0.885 |     0.115 |     0.263 | 
## -------------|-----------|-----------|-----------|
##            3 |      4722 |       569 |      5291 | 
##              |     0.892 |     0.108 |     0.133 | 
## -------------|-----------|-----------|-----------|
##            4 |      2380 |       246 |      2626 | 
##              |     0.906 |     0.094 |     0.066 | 
## -------------|-----------|-----------|-----------|
##            5 |      1466 |       120 |      1586 | 
##              |     0.924 |     0.076 |     0.040 | 
## -------------|-----------|-----------|-----------|
##            6 |       892 |        74 |       966 | 
##              |     0.923 |     0.077 |     0.024 | 
## -------------|-----------|-----------|-----------|
##            7 |       583 |        38 |       621 | 
##              |     0.939 |     0.061 |     0.016 | 
## -------------|-----------|-----------|-----------|
##            8 |       378 |        16 |       394 | 
##              |     0.959 |     0.041 |     0.010 | 
## -------------|-----------|-----------|-----------|
##            9 |       260 |        17 |       277 | 
##              |     0.939 |     0.061 |     0.007 | 
## -------------|-----------|-----------|-----------|
##           10 |       211 |        11 |       222 | 
##              |     0.950 |     0.050 |     0.006 | 
## -------------|-----------|-----------|-----------|
## Column Total |     35348 |      4564 |     39912 | 
## -------------|-----------|-----------|-----------|
## 
## 

There’s no particular pattern depending the different values of Campaign. We’ll lose a lot of information if we’re binning this variable. Hence, it’ll kept as a categorical variable with 10 levels.

All the variables are kept as character or numeric for now, we’ll change them to factors if needed later.

1.2.1.13 Pdays

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##    15    25    59   431   116    46   403    60    17    63    52    27 
##    12    13    14    15    16    17    18    19    20    21    22    25 
##    58    35    19    24    10     8     7     3     1     2     3     1 
##    26    27   999 
##     1     1 38425

This is the number of days that passed by after the client was last contacted from a previous campaign. 999 value means the client wasn’t previously contacted. Let’s make a dummy out of it.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  39912 
## 
##  
##              | y 
##  pdays_dummy |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |     34817 |      3608 |     38425 | 
##              |     0.906 |     0.094 |     0.963 | 
## -------------|-----------|-----------|-----------|
##            1 |       531 |       956 |      1487 | 
##              |     0.357 |     0.643 |     0.037 | 
## -------------|-----------|-----------|-----------|
## Column Total |     35348 |      4564 |     39912 | 
## -------------|-----------|-----------|-----------|
## 
## 

Recontacting a client after a previous campaign seems to highly increase the odds of subscription.

1.2.1.15 Poutcome

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  39912 
## 
##  
##              | y 
##     poutcome |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      failure |      3616 |       595 |      4211 | 
##              |     0.859 |     0.141 |     0.106 | 
## -------------|-----------|-----------|-----------|
##  nonexistent |     31268 |      3085 |     34353 | 
##              |     0.910 |     0.090 |     0.861 | 
## -------------|-----------|-----------|-----------|
##      success |       464 |       884 |      1348 | 
##              |     0.344 |     0.656 |     0.034 | 
## -------------|-----------|-----------|-----------|
## Column Total |     35348 |      4564 |     39912 | 
## -------------|-----------|-----------|-----------|
## 
## 

65.6% of people who already subscribed to a term deposit after a previous contact have accepted to do it again. Even if they refused before, they’re still more enthusiastic to accept it (14.1%) than people who haven’t been contacted before (9.0%). So even if the previous campaign was a failure, recontacting people is important.


So far, the social and economic context attributes are remaining. Since those indicators are highly correlated together, we’ll study those by pairs (bivariate study).

1.2.2 Bivariate analysis

1.2.2.1 Employment variation rate, Consumer price index, Consumer confidence index, Euribor 3 months rate, Number of employees

Those five continious variables are social and economic indicators. They’re supposed to be highly correlated. Let’s compute the correlation matrix.

As expected, three pairs show a correlation coefficient higher than 0.90 which is way too much. Our indicators are too correlated and share redundant information. Let’s figure out which variable(s) can be removed to lighten this correlation matrix.

emp.var.rate isn’t meaningful. We’re removing it to soften correlations between those 5 variables.

Even if there’s still a high correlation between two variables : euribor3m and nr.employed (0.94), we’re keeping both features. This is a spurious correlation, bank size (number of employees) isn’t reactive to the euribor rate.

Let’s continue the bivariate analysis by crossing our features with our predicted variable.

First, we’ll split character features and non-character features (numeric or factor) to use appropriate tests (anova for numeric features and chi squared test (Cramer’s V coefficient) for discrete variables).

Let’s start with some anova.

##                Df Sum Sq Mean Sq F value Pr(>F)    
## y               1    238  238.41   724.1 <2e-16 ***
## Residuals   39910  13140    0.33                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                Df Sum Sq Mean Sq F value Pr(>F)    
## y               1   2598  2598.1   120.5 <2e-16 ***
## Residuals   39910 860687    21.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                Df Sum Sq Mean Sq F value Pr(>F)    
## y               1  11401   11401    4154 <2e-16 ***
## Residuals   39910 109537       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## y               1  26374821 26374821    5738 <2e-16 ***
## Residuals   39910 183451532     4597                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All four remaining variables have their values statisticaly contrasted by the y response. This is good.

1.2.2.2 Cramer’s V against y (3 columns)

The Cramer’s V is the standardized value of the chi-squared test statistic and is calculated as follows: \[V = \sqrt{\frac{\chi^2}{n \cdot (\min (r, c) -1)}}\] whereby \(n\) is the size of the sample, \(r\) is the number of rows in our contingency table and \(c\) is the number of columns. Since one of the two inputs of the contingency table is our predicted variable, which has only two levels, \(\min (r,c) - 1\) is neccessarily equal to 1. So the formula is simply (in this case):

\[V = \sqrt{\frac{\chi^2}{n}}\]

High V value means an important dependancy between the feature and the y variable.

1.2.2.3 Cramer’s V matrix among features

This time, input can both have more than two levels, so the default formula is used: \[V = \sqrt{\frac{\chi^2}{n \cdot (\min (r, c) -1)}}\]

Except of the pdays_dummy and poutcome variables, the standardized chi squared statistic isn’t showing important dependecy between features. We’re arbitrarily keeping both variables.


So far, we’ve :

  • removed four variables : default (lack of variability), housing (lack of information), loan (lack of information) and emp.var.rate (lack of significience),
  • binned two variables : pdays and previous,
  • framed one varaible : campaign because it had outliers,
  • detected but kept two correlation issues : nr.employed with euribor3m and poutcome with pdays_dummy.

2 MODELS

2.1 Data preparation

Now we’ve selected our features, it is useful to transform every character variable to factor variables for interpretation purposes for the logistic regressions. This function also picks the reference value for comparisons (reference group).

Our dataset is ready, the only step remaining is splitting our data into training and validation sets. Since we’re going to test different sampling methods, the training set will change between methods but the test set will remain the same to compare the different models. Sampling will be made with the caret package, except for the SMOTE, which will be made with the DMwR package.

2.2 Normal sampling (unbalanced training set)

This is the default sampling, no change needs to be made.

Let’s start out with logistic regression, decision trees then random forests (through the ranger package).

2.2.1 Logistic regression

2.2.1.1 Results

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_ns)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2102  -0.3959  -0.3254  -0.2542   2.8461  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  58.3815416 24.7913690   2.355 0.018527 *  
## agemid                       -0.1596297  0.0565249  -2.824 0.004742 ** 
## agehigh                       0.0638445  0.1330406   0.480 0.631308    
## jobhousemaid                 -0.0887968  0.1823224  -0.487 0.626236    
## jobservices                  -0.1204328  0.1411130  -0.853 0.393410    
## jobadmin.                     0.0564314  0.1264671   0.446 0.655443    
## jobblue-collar               -0.1505559  0.1326503  -1.135 0.256382    
## jobtechnician                 0.0413816  0.1319284   0.314 0.753774    
## jobretired                    0.1857167  0.1593447   1.166 0.243815    
## jobmanagement                -0.0018819  0.1431125  -0.013 0.989508    
## jobself-employed             -0.0235987  0.1635393  -0.144 0.885264    
## jobentrepreneur               0.0544228  0.1632784   0.333 0.738898    
## jobstudent                    0.2336432  0.1567528   1.491 0.136087    
## maritalmarried                0.0249895  0.0510480   0.490 0.624467    
## maritaldivorced               0.0574781  0.0753998   0.762 0.445875    
## educationhigh.school          0.0517663  0.0896447   0.577 0.563628    
## educationbasic.6y             0.1177217  0.1166024   1.010 0.312687    
## educationbasic.9y            -0.0137289  0.0924755  -0.148 0.881980    
## educationprofessional.course  0.0848380  0.0988428   0.858 0.390720    
## educationuniversity.degree    0.0977261  0.0873560   1.119 0.263264    
## contactcellular               0.5115313  0.0686044   7.456 8.90e-14 ***
## month(05)may                 -1.4730158  0.1222787 -12.046  < 2e-16 ***
## month(06)jun                 -0.5294714  0.1473883  -3.592 0.000328 ***
## month(07)jul                 -0.6060177  0.1349348  -4.491 7.08e-06 ***
## month(08)aug                 -0.9763408  0.1346840  -7.249 4.19e-13 ***
## month(10)oct                 -1.0005521  0.1549178  -6.459 1.06e-10 ***
## month(11)nov                 -1.1413269  0.1317422  -8.663  < 2e-16 ***
## month(12)dec                 -0.7347432  0.2157858  -3.405 0.000662 ***
## month(04)apr                 -0.7560740  0.1332616  -5.674 1.40e-08 ***
## month(09)sep                 -1.3377130  0.1620985  -8.252  < 2e-16 ***
## day_of_week(02)tue            0.2425016  0.0648050   3.742 0.000183 ***
## day_of_week(03)wed            0.3439843  0.0643322   5.347 8.94e-08 ***
## day_of_week(04)thu            0.2825023  0.0626887   4.506 6.59e-06 ***
## day_of_week(05)fri            0.2342640  0.0650205   3.603 0.000315 ***
## campaign2                    -0.0294343  0.0487748  -0.603 0.546194    
## campaign3                     0.0804110  0.0625806   1.285 0.198821    
## campaign4                    -0.0050193  0.0868835  -0.058 0.953931    
## campaign5                    -0.2364392  0.1181480  -2.001 0.045369 *  
## campaign6                    -0.2176382  0.1514567  -1.437 0.150728    
## campaign7                    -0.4332961  0.2061754  -2.102 0.035589 *  
## campaign8                    -0.5788286  0.2818325  -2.054 0.039995 *  
## campaign9                    -0.2435047  0.3313750  -0.735 0.462442    
## campaign10                   -0.1347225  0.3359390  -0.401 0.688396    
## previous1                     0.2579275  0.2251453   1.146 0.251959    
## previous2+                    0.1679809  0.2724129   0.617 0.537472    
## poutcomefailure              -0.7651163  0.2204643  -3.470 0.000520 ***
## poutcomesuccess                      NA         NA      NA       NA    
## cons.price.idx               -0.0281799  0.1368159  -0.206 0.836815    
## cons.conf.idx                 0.0193871  0.0078440   2.472 0.013452 *  
## euribor3m                    -0.0007667  0.1333870  -0.006 0.995414    
## nr.employed                  -0.0110117  0.0025492  -4.320 1.56e-05 ***
## pdays_dummy1                  1.1447030  0.2241324   5.107 3.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22707  on 31930  degrees of freedom
## Residual deviance: 17776  on 31880  degrees of freedom
## AIC: 17878
## 
## Number of Fisher Scoring iterations: 6

2.2.1.2 Features importance

2.2.1.3 Predicted scores

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

2.2.1.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25581  1587
##          1  2698  2065
##                                          
##                Accuracy : 0.8658         
##                  95% CI : (0.862, 0.8695)
##     No Information Rate : 0.8856         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4151         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.56544        
##             Specificity : 0.90459        
##          Pos Pred Value : 0.43355        
##          Neg Pred Value : 0.94159        
##              Prevalence : 0.11437        
##          Detection Rate : 0.06467        
##    Detection Prevalence : 0.14917        
##       Balanced Accuracy : 0.73502        
##                                          
##        'Positive' Class : 1              
## 

2.2.1.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6399  420
##          1  670  492
##                                           
##                Accuracy : 0.8634          
##                  95% CI : (0.8557, 0.8709)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3973          
##                                           
##  Mcnemar's Test P-Value : 4.628e-14       
##                                           
##             Sensitivity : 0.53947         
##             Specificity : 0.90522         
##          Pos Pred Value : 0.42341         
##          Neg Pred Value : 0.93841         
##              Prevalence : 0.11427         
##          Detection Rate : 0.06165         
##    Detection Prevalence : 0.14560         
##       Balanced Accuracy : 0.72235         
##                                           
##        'Positive' Class : 1               
## 

2.2.2 Logistic regression 2 (simple)

2.2.2.1 Results

## 
## Call:
## glm(formula = y ~ . - job - marital - education - previous - 
##     euribor3m - cons.conf.idx - campaign, family = "binomial", 
##     data = bank_train_ns)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1363  -0.3951  -0.3346  -0.2382   2.7988  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        65.9168072  3.8313927  17.204  < 2e-16 ***
## agemid             -0.1769233  0.0487356  -3.630 0.000283 ***
## agehigh             0.1785377  0.0950352   1.879 0.060293 .  
## contactcellular     0.4059889  0.0610789   6.647 2.99e-11 ***
## month(05)may       -1.5308954  0.1143478 -13.388  < 2e-16 ***
## month(06)jun       -0.4872326  0.1247922  -3.904 9.45e-05 ***
## month(07)jul       -0.5316558  0.1246735  -4.264 2.00e-05 ***
## month(08)aug       -0.7590794  0.1211852  -6.264 3.76e-10 ***
## month(10)oct       -0.8283920  0.1417583  -5.844 5.11e-09 ***
## month(11)nov       -1.0477606  0.1256554  -8.338  < 2e-16 ***
## month(12)dec       -0.5671038  0.2090464  -2.713 0.006671 ** 
## month(04)apr       -0.8061365  0.1194677  -6.748 1.50e-11 ***
## month(09)sep       -1.1521548  0.1480853  -7.780 7.23e-15 ***
## day_of_week(02)tue  0.2561945  0.0643412   3.982 6.84e-05 ***
## day_of_week(03)wed  0.3487747  0.0640040   5.449 5.06e-08 ***
## day_of_week(04)thu  0.2890088  0.0621679   4.649 3.34e-06 ***
## day_of_week(05)fri  0.2376569  0.0647390   3.671 0.000242 ***
## poutcomefailure    -0.5213575  0.0631761  -8.252  < 2e-16 ***
## poutcomesuccess     0.2911045  0.2184294   1.333 0.182625    
## cons.price.idx     -0.0954891  0.0391932  -2.436 0.014836 *  
## nr.employed        -0.0113909  0.0003334 -34.164  < 2e-16 ***
## pdays_dummy1        1.1133871  0.2065204   5.391 7.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22707  on 31930  degrees of freedom
## Residual deviance: 17845  on 31909  degrees of freedom
## AIC: 17889
## 
## Number of Fisher Scoring iterations: 6

2.2.2.2 Features importance

2.2.2.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25535  1571
##          1  2744  2081
##                                           
##                Accuracy : 0.8649          
##                  95% CI : (0.8611, 0.8686)
##     No Information Rate : 0.8856          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4148          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.56982         
##             Specificity : 0.90297         
##          Pos Pred Value : 0.43130         
##          Neg Pred Value : 0.94204         
##              Prevalence : 0.11437         
##          Detection Rate : 0.06517         
##    Detection Prevalence : 0.15111         
##       Balanced Accuracy : 0.73640         
##                                           
##        'Positive' Class : 1               
## 

2.2.2.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6386  417
##          1  683  495
##                                           
##                Accuracy : 0.8622          
##                  95% CI : (0.8544, 0.8697)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3959          
##                                           
##  Mcnemar's Test P-Value : 1.349e-15       
##                                           
##             Sensitivity : 0.54276         
##             Specificity : 0.90338         
##          Pos Pred Value : 0.42020         
##          Neg Pred Value : 0.93870         
##              Prevalence : 0.11427         
##          Detection Rate : 0.06202         
##    Detection Prevalence : 0.14760         
##       Balanced Accuracy : 0.72307         
##                                           
##        'Positive' Class : 1               
## 

2.2.3 Decision Tree

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

2.2.3.1 Features importance

2.2.3.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 26477  1979
##          1  1802  1673
##                                          
##                Accuracy : 0.8816         
##                  95% CI : (0.878, 0.8851)
##     No Information Rate : 0.8856         
##     P-Value [Acc > NIR] : 0.988329       
##                                          
##                   Kappa : 0.4029         
##                                          
##  Mcnemar's Test P-Value : 0.004206       
##                                          
##             Sensitivity : 0.45811        
##             Specificity : 0.93628        
##          Pos Pred Value : 0.48144        
##          Neg Pred Value : 0.93045        
##              Prevalence : 0.11437        
##          Detection Rate : 0.05239        
##    Detection Prevalence : 0.10883        
##       Balanced Accuracy : 0.69719        
##                                          
##        'Positive' Class : 1              
## 

2.2.3.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6619  520
##          1  450  392
##                                           
##                Accuracy : 0.8785          
##                  95% CI : (0.8711, 0.8856)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 0.97953         
##                                           
##                   Kappa : 0.3788          
##                                           
##  Mcnemar's Test P-Value : 0.02673         
##                                           
##             Sensitivity : 0.42982         
##             Specificity : 0.93634         
##          Pos Pred Value : 0.46556         
##          Neg Pred Value : 0.92716         
##              Prevalence : 0.11427         
##          Detection Rate : 0.04912         
##    Detection Prevalence : 0.10550         
##       Balanced Accuracy : 0.68308         
##                                           
##        'Positive' Class : 1               
## 

2.2.4 Decision Tree 2

2.2.4.1 Features importance

2.2.4.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25928  1682
##          1  2351  1970
##                                         
##                Accuracy : 0.8737        
##                  95% CI : (0.87, 0.8773)
##     No Information Rate : 0.8856        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.4226        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.5394        
##             Specificity : 0.9169        
##          Pos Pred Value : 0.4559        
##          Neg Pred Value : 0.9391        
##              Prevalence : 0.1144        
##          Detection Rate : 0.0617        
##    Detection Prevalence : 0.1353        
##       Balanced Accuracy : 0.7281        
##                                         
##        'Positive' Class : 1             
## 

2.2.4.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6487  447
##          1  582  465
##                                           
##                Accuracy : 0.8711          
##                  95% CI : (0.8635, 0.8783)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4016          
##                                           
##  Mcnemar's Test P-Value : 2.95e-05        
##                                           
##             Sensitivity : 0.50987         
##             Specificity : 0.91767         
##          Pos Pred Value : 0.44413         
##          Neg Pred Value : 0.93554         
##              Prevalence : 0.11427         
##          Detection Rate : 0.05826         
##    Detection Prevalence : 0.13119         
##       Balanced Accuracy : 0.71377         
##                                           
##        'Positive' Class : 1               
## 

2.2.5 Random Forest (ranger)

2.2.5.1 Results

## Ranger result
## 
## Call:
##  ranger(y ~ ., data = bank_train_ns, num.trees = 1000, importance = "impurity",      write.forest = T, probability = T) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      31931 
## Number of independent variables:  15 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.07761157

2.2.5.2 Features importance

2.2.5.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24945   858
##          1  3334  2794
##                                          
##                Accuracy : 0.8687         
##                  95% CI : (0.865, 0.8724)
##     No Information Rate : 0.8856         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4997         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7651         
##             Specificity : 0.8821         
##          Pos Pred Value : 0.4559         
##          Neg Pred Value : 0.9667         
##              Prevalence : 0.1144         
##          Detection Rate : 0.0875         
##    Detection Prevalence : 0.1919         
##       Balanced Accuracy : 0.8236         
##                                          
##        'Positive' Class : 1              
## 

2.2.5.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6402  402
##          1  667  510
##                                           
##                Accuracy : 0.8661          
##                  95% CI : (0.8584, 0.8735)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4126          
##                                           
##  Mcnemar's Test P-Value : 6.776e-16       
##                                           
##             Sensitivity : 0.5592          
##             Specificity : 0.9056          
##          Pos Pred Value : 0.4333          
##          Neg Pred Value : 0.9409          
##              Prevalence : 0.1143          
##          Detection Rate : 0.0639          
##    Detection Prevalence : 0.1475          
##       Balanced Accuracy : 0.7324          
##                                           
##        'Positive' Class : 1               
## 

2.2.6 Model selection

2.3 !!! CASE 2 - DOWNSAMPLING SAMPLING

2.3.1 Logistic regression

2.3.1.1 Results

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_ds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7700  -0.8696  -0.1514   0.8139   2.0738  
## 
## Coefficients: (1 not defined because of singularities)
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  88.912763  37.071219   2.398 0.016465 *  
## agemid                       -0.185724   0.077777  -2.388 0.016944 *  
## agehigh                      -0.342582   0.211457  -1.620 0.105210    
## jobhousemaid                  0.090610   0.260078   0.348 0.727544    
## jobservices                   0.066291   0.202461   0.327 0.743344    
## jobadmin.                     0.167694   0.186331   0.900 0.368130    
## jobblue-collar                0.018984   0.192145   0.099 0.921297    
## jobtechnician                 0.232040   0.193374   1.200 0.230157    
## jobretired                    0.637648   0.236014   2.702 0.006898 ** 
## jobmanagement                 0.164375   0.208252   0.789 0.429931    
## jobself-employed              0.182019   0.233324   0.780 0.435323    
## jobentrepreneur               0.402299   0.233001   1.727 0.084240 .  
## jobstudent                    0.458142   0.245423   1.867 0.061937 .  
## maritalmarried                0.074860   0.069854   1.072 0.283870    
## maritaldivorced               0.088578   0.102645   0.863 0.388165    
## educationhigh.school          0.076348   0.123145   0.620 0.535267    
## educationbasic.6y             0.158329   0.152674   1.037 0.299718    
## educationbasic.9y             0.119823   0.123071   0.974 0.330252    
## educationprofessional.course  0.077591   0.137371   0.565 0.572193    
## educationuniversity.degree    0.166033   0.121409   1.368 0.171452    
## contactcellular               0.334683   0.101432   3.300 0.000968 ***
## month(05)may                 -1.468083   0.221400  -6.631 3.34e-11 ***
## month(06)jun                 -0.422951   0.277630  -1.523 0.127650    
## month(07)jul                 -0.479768   0.243097  -1.974 0.048431 *  
## month(08)aug                 -0.786387   0.251893  -3.122 0.001797 ** 
## month(10)oct                 -0.377485   0.291274  -1.296 0.194982    
## month(11)nov                 -1.160957   0.229159  -5.066 4.06e-07 ***
## month(12)dec                 -0.364799   0.446958  -0.816 0.414397    
## month(04)apr                 -0.701019   0.236928  -2.959 0.003089 ** 
## month(09)sep                 -1.230486   0.298092  -4.128 3.66e-05 ***
## day_of_week(02)tue            0.211085   0.088848   2.376 0.017511 *  
## day_of_week(03)wed            0.327703   0.087542   3.743 0.000182 ***
## day_of_week(04)thu            0.258029   0.086524   2.982 0.002862 ** 
## day_of_week(05)fri            0.215252   0.088210   2.440 0.014678 *  
## campaign2                    -0.039130   0.068827  -0.569 0.569673    
## campaign3                     0.019399   0.085767   0.226 0.821059    
## campaign4                    -0.008098   0.114729  -0.071 0.943727    
## campaign5                    -0.282353   0.153212  -1.843 0.065344 .  
## campaign6                    -0.147349   0.193541  -0.761 0.446458    
## campaign7                    -0.269657   0.256845  -1.050 0.293772    
## campaign8                    -0.395457   0.341930  -1.157 0.247458    
## campaign9                    -0.361462   0.377399  -0.958 0.338178    
## campaign10                   -0.412008   0.402783  -1.023 0.306354    
## previous1                    -0.140060   0.513467  -0.273 0.785028    
## previous2+                   -0.279052   0.589601  -0.473 0.636007    
## poutcomefailure              -0.312718   0.509859  -0.613 0.539650    
## poutcomesuccess                     NA         NA      NA       NA    
## cons.price.idx               -0.221133   0.206868  -1.069 0.285089    
## cons.conf.idx                 0.002998   0.013625   0.220 0.825822    
## euribor3m                     0.119074   0.197985   0.601 0.547554    
## nr.employed                  -0.013277   0.003836  -3.461 0.000538 ***
## pdays_dummy1                  1.465256   0.509746   2.874 0.004047 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10125.5  on 7303  degrees of freedom
## Residual deviance:  7813.7  on 7253  degrees of freedom
## AIC: 7915.7
## 
## Number of Fisher Scoring iterations: 5

2.3.1.2 Features importance

2.3.1.3 Predicted scores

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

2.3.1.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3081 1311
##          1  571 2341
##                                           
##                Accuracy : 0.7423          
##                  95% CI : (0.7321, 0.7523)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4847          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6410          
##             Specificity : 0.8436          
##          Pos Pred Value : 0.8039          
##          Neg Pred Value : 0.7015          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3205          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7423          
##                                           
##        'Positive' Class : 1               
## 

2.3.1.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5942  365
##          1 1127  547
##                                           
##                Accuracy : 0.8131          
##                  95% CI : (0.8043, 0.8216)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3229          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59978         
##             Specificity : 0.84057         
##          Pos Pred Value : 0.32676         
##          Neg Pred Value : 0.94213         
##              Prevalence : 0.11427         
##          Detection Rate : 0.06854         
##    Detection Prevalence : 0.20975         
##       Balanced Accuracy : 0.72018         
##                                           
##        'Positive' Class : 1               
## 

2.3.2 Logistic regression 2 (simple)

2.3.2.1 Results

## 
## Call:
## glm(formula = y ~ . - marital - education - previous - campaign - 
##     cons.conf.idx - euribor3m, family = "binomial", data = bank_train_ds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8595  -0.8766  -0.1776   0.8226   2.0185  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        68.4737588  6.2919393  10.883  < 2e-16 ***
## agemid             -0.1580900  0.0731351  -2.162 0.030648 *  
## agehigh            -0.2977445  0.2061601  -1.444 0.148672    
## jobhousemaid        0.0695177  0.2576437   0.270 0.787298    
## jobservices         0.0438934  0.1996959   0.220 0.826026    
## jobadmin.           0.1754483  0.1837791   0.955 0.339745    
## jobblue-collar     -0.0003024  0.1876683  -0.002 0.998714    
## jobtechnician       0.2148402  0.1889763   1.137 0.255595    
## jobretired          0.6037687  0.2340050   2.580 0.009876 ** 
## jobmanagement       0.2052460  0.2044126   1.004 0.315341    
## jobself-employed    0.1928434  0.2314953   0.833 0.404826    
## jobentrepreneur     0.4171944  0.2316734   1.801 0.071737 .  
## jobstudent          0.4354593  0.2434884   1.788 0.073708 .  
## contactcellular     0.2927957  0.0891209   3.285 0.001018 ** 
## month(05)may       -1.4826630  0.2097093  -7.070 1.55e-12 ***
## month(06)jun       -0.4883279  0.2248794  -2.172 0.029893 *  
## month(07)jul       -0.4886509  0.2236833  -2.185 0.028921 *  
## month(08)aug       -0.7177999  0.2211024  -3.246 0.001168 ** 
## month(10)oct       -0.2655819  0.2801001  -0.948 0.343044    
## month(11)nov       -1.0934022  0.2231401  -4.900 9.58e-07 ***
## month(12)dec       -0.2698585  0.4368450  -0.618 0.536744    
## month(04)apr       -0.7460241  0.2190123  -3.406 0.000658 ***
## month(09)sep       -1.1017576  0.2824608  -3.901 9.60e-05 ***
## day_of_week(02)tue  0.2218305  0.0884197   2.509 0.012113 *  
## day_of_week(03)wed  0.3374821  0.0871291   3.873 0.000107 ***
## day_of_week(04)thu  0.2638747  0.0859392   3.070 0.002137 ** 
## day_of_week(05)fri  0.2201694  0.0878446   2.506 0.012198 *  
## poutcomefailure    -0.4705050  0.0926822  -5.077 3.84e-07 ***
## poutcomesuccess    -0.0844897  0.5020621  -0.168 0.866359    
## cons.price.idx     -0.1200751  0.0697635  -1.721 0.085219 .  
## nr.employed        -0.0110661  0.0005790 -19.114  < 2e-16 ***
## pdays_dummy1        1.3719416  0.4816075   2.849 0.004390 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10125.5  on 7303  degrees of freedom
## Residual deviance:  7828.1  on 7272  degrees of freedom
## AIC: 7892.1
## 
## Number of Fisher Scoring iterations: 5

2.3.2.2 Features importance

2.3.2.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3067 1311
##          1  585 2341
##                                           
##                Accuracy : 0.7404          
##                  95% CI : (0.7302, 0.7504)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4808          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6410          
##             Specificity : 0.8398          
##          Pos Pred Value : 0.8001          
##          Neg Pred Value : 0.7005          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3205          
##    Detection Prevalence : 0.4006          
##       Balanced Accuracy : 0.7404          
##                                           
##        'Positive' Class : 1               
## 

2.3.2.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5915  359
##          1 1154  553
##                                          
##                Accuracy : 0.8104         
##                  95% CI : (0.8016, 0.819)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3212         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.60636        
##             Specificity : 0.83675        
##          Pos Pred Value : 0.32396        
##          Neg Pred Value : 0.94278        
##              Prevalence : 0.11427        
##          Detection Rate : 0.06929        
##    Detection Prevalence : 0.21388        
##       Balanced Accuracy : 0.72156        
##                                          
##        'Positive' Class : 1              
## 

2.3.3 Decision Tree

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

2.3.3.1 Features importance

2.3.3.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3131 1232
##          1  521 2420
##                                         
##                Accuracy : 0.76          
##                  95% CI : (0.75, 0.7698)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.52          
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.6627        
##             Specificity : 0.8573        
##          Pos Pred Value : 0.8228        
##          Neg Pred Value : 0.7176        
##              Prevalence : 0.5000        
##          Detection Rate : 0.3313        
##    Detection Prevalence : 0.4027        
##       Balanced Accuracy : 0.7600        
##                                         
##        'Positive' Class : 1             
## 

2.3.3.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5965  350
##          1 1104  562
##                                           
##                Accuracy : 0.8178          
##                  95% CI : (0.8092, 0.8262)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3383          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.61623         
##             Specificity : 0.84383         
##          Pos Pred Value : 0.33733         
##          Neg Pred Value : 0.94458         
##              Prevalence : 0.11427         
##          Detection Rate : 0.07042         
##    Detection Prevalence : 0.20875         
##       Balanced Accuracy : 0.73003         
##                                           
##        'Positive' Class : 1               
## 

2.3.4 Decision Tree 2

2.3.4.1 Features importance

2.3.4.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3068 1209
##          1  584 2443
##                                           
##                Accuracy : 0.7545          
##                  95% CI : (0.7445, 0.7644)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.509           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6689          
##             Specificity : 0.8401          
##          Pos Pred Value : 0.8071          
##          Neg Pred Value : 0.7173          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3345          
##    Detection Prevalence : 0.4144          
##       Balanced Accuracy : 0.7545          
##                                           
##        'Positive' Class : 1               
## 

2.3.4.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5877  336
##          1 1192  576
##                                           
##                Accuracy : 0.8085          
##                  95% CI : (0.7997, 0.8171)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3286          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63158         
##             Specificity : 0.83138         
##          Pos Pred Value : 0.32579         
##          Neg Pred Value : 0.94592         
##              Prevalence : 0.11427         
##          Detection Rate : 0.07217         
##    Detection Prevalence : 0.22153         
##       Balanced Accuracy : 0.73148         
##                                           
##        'Positive' Class : 1               
## 

2.3.5 Random Forest (ranger)

2.3.5.1 Results

## Ranger result
## 
## Call:
##  ranger(y ~ ., data = bank_train_ds, num.trees = 1000, importance = "impurity",      write.forest = T, probability = T) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      7304 
## Number of independent variables:  15 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1753084

2.3.5.2 Features importance

2.3.5.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2940  527
##          1  712 3125
##                                           
##                Accuracy : 0.8304          
##                  95% CI : (0.8216, 0.8389)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6607          
##                                           
##  Mcnemar's Test P-Value : 1.72e-07        
##                                           
##             Sensitivity : 0.8557          
##             Specificity : 0.8050          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.8480          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4278          
##    Detection Prevalence : 0.5253          
##       Balanced Accuracy : 0.8304          
##                                           
##        'Positive' Class : 1               
## 

2.3.5.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5977  337
##          1 1092  575
##                                           
##                Accuracy : 0.8209          
##                  95% CI : (0.8124, 0.8293)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3499          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63048         
##             Specificity : 0.84552         
##          Pos Pred Value : 0.34493         
##          Neg Pred Value : 0.94663         
##              Prevalence : 0.11427         
##          Detection Rate : 0.07205         
##    Detection Prevalence : 0.20887         
##       Balanced Accuracy : 0.73800         
##                                           
##        'Positive' Class : 1               
## 

2.3.6 Model selection

2.4 !!! CASE 3 - DOWNSAMPLING SAMPLING

2.4.1 Logistic regression

2.4.1.1 Results

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_us)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0217  -0.8684  -0.1270   0.8192   1.9972  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  56.6400226 13.4536911   4.210 2.55e-05 ***
## agemid                       -0.1692613  0.0282536  -5.991 2.09e-09 ***
## agehigh                       0.0937154  0.0793862   1.180 0.237802    
## jobhousemaid                 -0.1156831  0.0903583  -1.280 0.200450    
## jobservices                  -0.1264988  0.0715798  -1.767 0.077188 .  
## jobadmin.                     0.0009011  0.0656151   0.014 0.989043    
## jobblue-collar               -0.1320699  0.0674178  -1.959 0.050116 .  
## jobtechnician                 0.0652448  0.0680259   0.959 0.337500    
## jobretired                    0.2472192  0.0837829   2.951 0.003170 ** 
## jobmanagement                -0.0294588  0.0737249  -0.400 0.689467    
## jobself-employed             -0.0067750  0.0827360  -0.082 0.934737    
## jobentrepreneur               0.1233806  0.0812800   1.518 0.129022    
## jobstudent                    0.2988550  0.0878194   3.403 0.000666 ***
## maritalmarried                0.0554527  0.0254305   2.181 0.029216 *  
## maritaldivorced               0.1080766  0.0370867   2.914 0.003566 ** 
## educationhigh.school          0.1494877  0.0448192   3.335 0.000852 ***
## educationbasic.6y             0.2920990  0.0552003   5.292 1.21e-07 ***
## educationbasic.9y             0.0949928  0.0445902   2.130 0.033143 *  
## educationprofessional.course  0.1354616  0.0499531   2.712 0.006692 ** 
## educationuniversity.degree    0.1795015  0.0441679   4.064 4.82e-05 ***
## contactcellular               0.4567357  0.0369481  12.362  < 2e-16 ***
## month(05)may                 -1.6351093  0.0840180 -19.461  < 2e-16 ***
## month(06)jun                 -0.6290126  0.1047298  -6.006 1.90e-09 ***
## month(07)jul                 -0.7086905  0.0920235  -7.701 1.35e-14 ***
## month(08)aug                 -1.0586539  0.0952225 -11.118  < 2e-16 ***
## month(10)oct                 -0.4070969  0.1086347  -3.747 0.000179 ***
## month(11)nov                 -1.2659289  0.0863720 -14.657  < 2e-16 ***
## month(12)dec                 -0.6711516  0.1559816  -4.303 1.69e-05 ***
## month(04)apr                 -0.8971128  0.0890845 -10.070  < 2e-16 ***
## month(09)sep                 -1.3985292  0.1112664 -12.569  < 2e-16 ***
## day_of_week(02)tue            0.1180179  0.0322000   3.665 0.000247 ***
## day_of_week(03)wed            0.2794737  0.0316446   8.832  < 2e-16 ***
## day_of_week(04)thu            0.1860305  0.0312275   5.957 2.56e-09 ***
## day_of_week(05)fri            0.1841214  0.0321461   5.728 1.02e-08 ***
## campaign2                    -0.0132633  0.0247179  -0.537 0.591554    
## campaign3                     0.0498580  0.0312476   1.596 0.110583    
## campaign4                     0.0284793  0.0418671   0.680 0.496358    
## campaign5                    -0.1709731  0.0545138  -3.136 0.001711 ** 
## campaign6                    -0.0738881  0.0686335  -1.077 0.281677    
## campaign7                    -0.3910252  0.0921780  -4.242 2.21e-05 ***
## campaign8                    -0.6665128  0.1255954  -5.307 1.12e-07 ***
## campaign9                    -0.2039070  0.1410048  -1.446 0.148149    
## campaign10                   -0.2936664  0.1536100  -1.912 0.055906 .  
## previous1                     0.3188200  0.1672746   1.906 0.056654 .  
## previous2+                    0.1142981  0.1984781   0.576 0.564701    
## poutcomefailure              -0.8150449  0.1659705  -4.911 9.07e-07 ***
## poutcomesuccess                      NA         NA      NA       NA    
## cons.price.idx               -0.0096777  0.0753486  -0.128 0.897801    
## cons.conf.idx                 0.0197910  0.0050271   3.937 8.25e-05 ***
## euribor3m                    -0.0212993  0.0716925  -0.297 0.766396    
## nr.employed                  -0.0105780  0.0013871  -7.626 2.42e-14 ***
## pdays_dummy1                  1.1722527  0.1648295   7.112 1.14e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78406  on 56557  degrees of freedom
## Residual deviance: 60006  on 56507  degrees of freedom
## AIC: 60108
## 
## Number of Fisher Scoring iterations: 5

2.4.1.2 Features importance

2.4.1.3 Predicted scores

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

2.4.1.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24111 10234
##          1  4168 18045
##                                           
##                Accuracy : 0.7454          
##                  95% CI : (0.7417, 0.7489)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4907          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6381          
##             Specificity : 0.8526          
##          Pos Pred Value : 0.8124          
##          Neg Pred Value : 0.7020          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3191          
##    Detection Prevalence : 0.3927          
##       Balanced Accuracy : 0.7454          
##                                           
##        'Positive' Class : 1               
## 

2.4.1.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6016  371
##          1 1053  541
##                                          
##                Accuracy : 0.8216         
##                  95% CI : (0.813, 0.8299)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3351         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.59320        
##             Specificity : 0.85104        
##          Pos Pred Value : 0.33940        
##          Neg Pred Value : 0.94191        
##              Prevalence : 0.11427        
##          Detection Rate : 0.06779        
##    Detection Prevalence : 0.19972        
##       Balanced Accuracy : 0.72212        
##                                          
##        'Positive' Class : 1              
## 

2.4.2 Logistic regression 2 (simple)

2.4.2.1 Results

## 
## Call:
## glm(formula = y ~ . - cons.price.idx - euribor3m, family = "binomial", 
##     data = bank_train_us)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0216  -0.8687  -0.1287   0.8213   1.9931  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  58.1287988  1.0373972  56.033  < 2e-16 ***
## agemid                       -0.1703704  0.0282327  -6.035 1.59e-09 ***
## agehigh                       0.0942782  0.0793556   1.188 0.234815    
## jobhousemaid                 -0.1170609  0.0903389  -1.296 0.195046    
## jobservices                  -0.1281882  0.0715445  -1.792 0.073177 .  
## jobadmin.                    -0.0003758  0.0655878  -0.006 0.995428    
## jobblue-collar               -0.1328268  0.0673758  -1.971 0.048674 *  
## jobtechnician                 0.0628632  0.0679813   0.925 0.355115    
## jobretired                    0.2458144  0.0837528   2.935 0.003335 ** 
## jobmanagement                -0.0302889  0.0737014  -0.411 0.681096    
## jobself-employed             -0.0080796  0.0827049  -0.098 0.922177    
## jobentrepreneur               0.1238097  0.0812484   1.524 0.127548    
## jobstudent                    0.2972363  0.0877990   3.385 0.000711 ***
## maritalmarried                0.0547537  0.0254206   2.154 0.031247 *  
## maritaldivorced               0.1070324  0.0370727   2.887 0.003888 ** 
## educationhigh.school          0.1501099  0.0448107   3.350 0.000808 ***
## educationbasic.6y             0.2926070  0.0551719   5.304 1.14e-07 ***
## educationbasic.9y             0.0955512  0.0445764   2.144 0.032070 *  
## educationprofessional.course  0.1357168  0.0499521   2.717 0.006589 ** 
## educationuniversity.degree    0.1794330  0.0441525   4.064 4.83e-05 ***
## contactcellular               0.4747518  0.0334315  14.201  < 2e-16 ***
## month(05)may                 -1.6298604  0.0799383 -20.389  < 2e-16 ***
## month(06)jun                 -0.6261661  0.0845339  -7.407 1.29e-13 ***
## month(07)jul                 -0.7172775  0.0851193  -8.427  < 2e-16 ***
## month(08)aug                 -1.0544838  0.0913791 -11.540  < 2e-16 ***
## month(10)oct                 -0.4169102  0.1083661  -3.847 0.000119 ***
## month(11)nov                 -1.2682649  0.0864823 -14.665  < 2e-16 ***
## month(12)dec                 -0.6680157  0.1560079  -4.282 1.85e-05 ***
## month(04)apr                 -0.8896348  0.0825367 -10.779  < 2e-16 ***
## month(09)sep                 -1.4185115  0.1100721 -12.887  < 2e-16 ***
## day_of_week(02)tue            0.1178911  0.0321954   3.662 0.000251 ***
## day_of_week(03)wed            0.2793659  0.0316418   8.829  < 2e-16 ***
## day_of_week(04)thu            0.1862294  0.0311848   5.972 2.35e-09 ***
## day_of_week(05)fri            0.1848426  0.0321153   5.756 8.63e-09 ***
## campaign2                    -0.0137088  0.0247116  -0.555 0.579063    
## campaign3                     0.0487829  0.0312224   1.562 0.118187    
## campaign4                     0.0283624  0.0418558   0.678 0.498012    
## campaign5                    -0.1708053  0.0545032  -3.134 0.001725 ** 
## campaign6                    -0.0725155  0.0685905  -1.057 0.290410    
## campaign7                    -0.3903269  0.0921664  -4.235 2.29e-05 ***
## campaign8                    -0.6644914  0.1255451  -5.293 1.20e-07 ***
## campaign9                    -0.2045572  0.1409559  -1.451 0.146720    
## campaign10                   -0.2952504  0.1536090  -1.922 0.054594 .  
## previous1                     0.3160783  0.1674063   1.888 0.059014 .  
## previous2+                    0.0946721  0.1979125   0.478 0.632399    
## poutcomefailure              -0.8090101  0.1660720  -4.871 1.11e-06 ***
## poutcomesuccess                      NA         NA      NA       NA    
## cons.conf.idx                 0.0191473  0.0031135   6.150 7.76e-10 ***
## nr.employed                  -0.0110631  0.0002049 -53.986  < 2e-16 ***
## pdays_dummy1                  1.1752476  0.1650095   7.122 1.06e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78406  on 56557  degrees of freedom
## Residual deviance: 60008  on 56509  degrees of freedom
## AIC: 60106
## 
## Number of Fisher Scoring iterations: 5

2.4.2.2 Features importance

2.4.2.3 Predicted scores

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

2.4.2.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24143 10279
##          1  4136 18000
##                                           
##                Accuracy : 0.7451          
##                  95% CI : (0.7415, 0.7487)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4903          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6365          
##             Specificity : 0.8537          
##          Pos Pred Value : 0.8132          
##          Neg Pred Value : 0.7014          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3183          
##    Detection Prevalence : 0.3914          
##       Balanced Accuracy : 0.7451          
##                                           
##        'Positive' Class : 1               
## 

2.4.2.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6028  374
##          1 1041  538
##                                          
##                Accuracy : 0.8227         
##                  95% CI : (0.8141, 0.831)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3357         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.58991        
##             Specificity : 0.85274        
##          Pos Pred Value : 0.34072        
##          Neg Pred Value : 0.94158        
##              Prevalence : 0.11427        
##          Detection Rate : 0.06741        
##    Detection Prevalence : 0.19784        
##       Balanced Accuracy : 0.72132        
##                                          
##        'Positive' Class : 1              
## 

2.4.3 Decision Tree

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

2.4.3.1 Features importance

2.4.3.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24286  9076
##          1  3993 19203
##                                           
##                Accuracy : 0.7689          
##                  95% CI : (0.7654, 0.7724)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5379          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6791          
##             Specificity : 0.8588          
##          Pos Pred Value : 0.8279          
##          Neg Pred Value : 0.7280          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3395          
##    Detection Prevalence : 0.4101          
##       Balanced Accuracy : 0.7689          
##                                           
##        'Positive' Class : 1               
## 

2.4.3.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6045  364
##          1 1024  548
##                                           
##                Accuracy : 0.8261          
##                  95% CI : (0.8176, 0.8343)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3467          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60088         
##             Specificity : 0.85514         
##          Pos Pred Value : 0.34860         
##          Neg Pred Value : 0.94320         
##              Prevalence : 0.11427         
##          Detection Rate : 0.06866         
##    Detection Prevalence : 0.19697         
##       Balanced Accuracy : 0.72801         
##                                           
##        'Positive' Class : 1               
## 

2.4.4 Decision Tree 2

2.4.4.1 Features importance

2.4.4.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24394  9951
##          1  3885 18328
##                                           
##                Accuracy : 0.7554          
##                  95% CI : (0.7518, 0.7589)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5107          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6481          
##             Specificity : 0.8626          
##          Pos Pred Value : 0.8251          
##          Neg Pred Value : 0.7103          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3241          
##    Detection Prevalence : 0.3927          
##       Balanced Accuracy : 0.7554          
##                                           
##        'Positive' Class : 1               
## 

2.4.4.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6109  368
##          1  960  544
##                                           
##                Accuracy : 0.8336          
##                  95% CI : (0.8252, 0.8417)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3592          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59649         
##             Specificity : 0.86420         
##          Pos Pred Value : 0.36170         
##          Neg Pred Value : 0.94318         
##              Prevalence : 0.11427         
##          Detection Rate : 0.06816         
##    Detection Prevalence : 0.18845         
##       Balanced Accuracy : 0.73034         
##                                           
##        'Positive' Class : 1               
## 

2.4.5 Random Forest (ranger)

2.4.5.1 Results

## Ranger result
## 
## Call:
##  ranger(y ~ ., data = bank_train_us, num.trees = 1000, importance = "impurity",      write.forest = T, probability = T) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      56558 
## Number of independent variables:  15 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1206238

2.4.5.2 Features importance

2.4.5.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23823  3063
##          1  4456 25216
##                                           
##                Accuracy : 0.8671          
##                  95% CI : (0.8642, 0.8698)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7341          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8917          
##             Specificity : 0.8424          
##          Pos Pred Value : 0.8498          
##          Neg Pred Value : 0.8861          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4458          
##    Detection Prevalence : 0.5246          
##       Balanced Accuracy : 0.8671          
##                                           
##        'Positive' Class : 1               
## 

2.4.5.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6437  424
##          1  632  488
##                                          
##                Accuracy : 0.8677         
##                  95% CI : (0.8601, 0.875)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4054         
##                                          
##  Mcnemar's Test P-Value : 1.89e-10       
##                                          
##             Sensitivity : 0.53509        
##             Specificity : 0.91060        
##          Pos Pred Value : 0.43571        
##          Neg Pred Value : 0.93820        
##              Prevalence : 0.11427        
##          Detection Rate : 0.06115        
##    Detection Prevalence : 0.14033        
##       Balanced Accuracy : 0.72284        
##                                          
##        'Positive' Class : 1              
## 

2.4.6 Model selection

2.5 !!! CASE 4 - SMOTE SAMPLING

2.5.1 Logistic regression

2.5.1.1 Results

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_smote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9080  -0.5557  -0.4229   0.3906   2.8333  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  93.2488632 19.7373337   4.724 2.31e-06 ***
## agemid                       -0.5344829  0.0452568 -11.810  < 2e-16 ***
## agehigh                       0.3987200  0.1090570   3.656 0.000256 ***
## jobhousemaid                 -0.3802525  0.1612967  -2.357 0.018400 *  
## jobservices                  -0.4159130  0.1229044  -3.384 0.000714 ***
## jobadmin.                    -0.3746699  0.1118170  -3.351 0.000806 ***
## jobblue-collar               -0.4558465  0.1157280  -3.939 8.18e-05 ***
## jobtechnician                -0.2632381  0.1154590  -2.280 0.022612 *  
## jobretired                   -0.0589143  0.1344556  -0.438 0.661263    
## jobmanagement                -0.3870398  0.1285658  -3.010 0.002609 ** 
## jobself-employed             -0.4034715  0.1504745  -2.681 0.007333 ** 
## jobentrepreneur              -0.2579890  0.1457006  -1.771 0.076614 .  
## jobstudent                    0.1946682  0.1393482   1.397 0.162416    
## maritalmarried               -0.1742006  0.0427968  -4.070 4.69e-05 ***
## maritaldivorced               0.0410038  0.0646467   0.634 0.525901    
## educationhigh.school          0.0705941  0.0764061   0.924 0.355521    
## educationbasic.6y            -0.0047153  0.1060676  -0.044 0.964541    
## educationbasic.9y            -0.0478872  0.0799833  -0.599 0.549363    
## educationprofessional.course  0.0002181  0.0853987   0.003 0.997962    
## educationuniversity.degree    0.0253369  0.0751128   0.337 0.735877    
## contactcellular              -0.2847054  0.0502567  -5.665 1.47e-08 ***
## month(05)may                 -1.6654771  0.1285150 -12.959  < 2e-16 ***
## month(06)jun                 -0.6333468  0.1483437  -4.269 1.96e-05 ***
## month(07)jul                 -0.5554402  0.1381080  -4.022 5.78e-05 ***
## month(08)aug                 -0.7539648  0.1426203  -5.287 1.25e-07 ***
## month(10)oct                 -0.6250673  0.1673719  -3.735 0.000188 ***
## month(11)nov                 -1.0770170  0.1382262  -7.792 6.61e-15 ***
## month(12)dec                 -0.6907535  0.2400215  -2.878 0.004004 ** 
## month(04)apr                 -0.8380774  0.1359455  -6.165 7.06e-10 ***
## month(09)sep                 -1.2121525  0.1672006  -7.250 4.18e-13 ***
## day_of_week(02)tue            0.2227235  0.0584656   3.809 0.000139 ***
## day_of_week(03)wed            0.1938470  0.0584308   3.318 0.000908 ***
## day_of_week(04)thu            0.2133142  0.0568266   3.754 0.000174 ***
## day_of_week(05)fri            0.1727889  0.0578978   2.984 0.002842 ** 
## campaign2                     0.1206141  0.0445051   2.710 0.006726 ** 
## campaign3                     0.2059940  0.0569927   3.614 0.000301 ***
## campaign4                     0.0950811  0.0765380   1.242 0.214136    
## campaign5                    -0.1377385  0.1024735  -1.344 0.178904    
## campaign6                    -0.1368122  0.1243314  -1.100 0.271165    
## campaign7                    -0.1579635  0.1740050  -0.908 0.363979    
## campaign8                    -0.6284708  0.2375737  -2.645 0.008160 ** 
## campaign9                    -0.2192995  0.3040784  -0.721 0.470790    
## campaign10                   -0.2868000  0.2896938  -0.990 0.322169    
## previous1                     0.9482097  0.0735581  12.891  < 2e-16 ***
## previous2+                    1.2651863  0.1005609  12.581  < 2e-16 ***
## poutcomefailure              -1.1055365  0.0794350 -13.918  < 2e-16 ***
## poutcomesuccess               1.7184241  0.0872614  19.693  < 2e-16 ***
## cons.price.idx               -0.2107228  0.1101344  -1.913 0.055707 .  
## cons.conf.idx                -0.0095942  0.0075382  -1.273 0.203111    
## euribor3m                     0.0892138  0.1068263   0.835 0.403645    
## nr.employed                  -0.0142885  0.0020582  -6.942 3.86e-12 ***
## pdays_dummy1                  1.8923284  0.0795544  23.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34916  on 25563  degrees of freedom
## Residual deviance: 19434  on 25512  degrees of freedom
## AIC: 19538
## 
## Number of Fisher Scoring iterations: 6

2.5.1.2 Features importance

2.5.1.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13438  2482
##          1  1170  8474
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.8528, 0.8614)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7039          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7735          
##             Specificity : 0.9199          
##          Pos Pred Value : 0.8787          
##          Neg Pred Value : 0.8441          
##              Prevalence : 0.4286          
##          Detection Rate : 0.3315          
##    Detection Prevalence : 0.3772          
##       Balanced Accuracy : 0.8467          
##                                           
##        'Positive' Class : 1               
## 

2.5.1.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6473  465
##          1  596  447
##                                           
##                Accuracy : 0.8671          
##                  95% CI : (0.8594, 0.8744)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3819          
##                                           
##  Mcnemar's Test P-Value : 6.579e-05       
##                                           
##             Sensitivity : 0.49013         
##             Specificity : 0.91569         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.93298         
##              Prevalence : 0.11427         
##          Detection Rate : 0.05601         
##    Detection Prevalence : 0.13069         
##       Balanced Accuracy : 0.70291         
##                                           
##        'Positive' Class : 1               
## 

2.5.2 Logistic regression 2 (simple)

2.5.2.1 Results

## 
## Call:
## glm(formula = y ~ . - education - euribor3m - cons.conf.idx, 
##     family = "binomial", data = bank_train_smote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8855  -0.5571  -0.4240   0.3897   2.8093  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        76.7013362  3.7012677  20.723  < 2e-16 ***
## agemid             -0.5359537  0.0451457 -11.872  < 2e-16 ***
## agehigh             0.3883324  0.1082084   3.589 0.000332 ***
## jobhousemaid       -0.3839028  0.1600979  -2.398 0.016488 *  
## jobservices        -0.3881456  0.1216401  -3.191 0.001418 ** 
## jobadmin.          -0.3583942  0.1110376  -3.228 0.001248 ** 
## jobblue-collar     -0.4784466  0.1136135  -4.211 2.54e-05 ***
## jobtechnician      -0.2637394  0.1139216  -2.315 0.020608 *  
## jobretired         -0.0614351  0.1341764  -0.458 0.647047    
## jobmanagement      -0.3769169  0.1271588  -2.964 0.003035 ** 
## jobself-employed   -0.4050365  0.1498441  -2.703 0.006871 ** 
## jobentrepreneur    -0.2508600  0.1452633  -1.727 0.084180 .  
## jobstudent          0.2024349  0.1393701   1.452 0.146363    
## maritalmarried     -0.1802077  0.0425488  -4.235 2.28e-05 ***
## maritaldivorced     0.0388334  0.0645479   0.602 0.547426    
## contactcellular    -0.2659993  0.0475085  -5.599 2.16e-08 ***
## month(05)may       -1.6948951  0.1262887 -13.421  < 2e-16 ***
## month(06)jun       -0.6983883  0.1344518  -5.194 2.05e-07 ***
## month(07)jul       -0.5979308  0.1341108  -4.458 8.25e-06 ***
## month(08)aug       -0.8212427  0.1334215  -6.155 7.50e-10 ***
## month(10)oct       -0.6525206  0.1626887  -4.011 6.05e-05 ***
## month(11)nov       -1.0887553  0.1367547  -7.961 1.70e-15 ***
## month(12)dec       -0.7289005  0.2372191  -3.073 0.002121 ** 
## month(04)apr       -0.8622479  0.1314250  -6.561 5.35e-11 ***
## month(09)sep       -1.2435974  0.1625352  -7.651 1.99e-14 ***
## day_of_week(02)tue  0.2216926  0.0583972   3.796 0.000147 ***
## day_of_week(03)wed  0.1921815  0.0583995   3.291 0.000999 ***
## day_of_week(04)thu  0.2099647  0.0567778   3.698 0.000217 ***
## day_of_week(05)fri  0.1721642  0.0578598   2.976 0.002925 ** 
## campaign2           0.1188650  0.0444799   2.672 0.007533 ** 
## campaign3           0.2065160  0.0569361   3.627 0.000287 ***
## campaign4           0.0917713  0.0765087   1.199 0.230338    
## campaign5          -0.1348198  0.1023636  -1.317 0.187816    
## campaign6          -0.1380927  0.1243314  -1.111 0.266705    
## campaign7          -0.1608393  0.1737981  -0.925 0.354738    
## campaign8          -0.6227392  0.2373027  -2.624 0.008684 ** 
## campaign9          -0.2195498  0.3041841  -0.722 0.470438    
## campaign10         -0.2740653  0.2883995  -0.950 0.341961    
## previous1           0.9494339  0.0735275  12.913  < 2e-16 ***
## previous2+          1.2650879  0.1005895  12.577  < 2e-16 ***
## poutcomefailure    -1.1068570  0.0794128 -13.938  < 2e-16 ***
## poutcomesuccess     1.7186775  0.0872048  19.709  < 2e-16 ***
## cons.price.idx     -0.1210621  0.0415241  -2.915 0.003552 ** 
## nr.employed        -0.0125620  0.0003288 -38.204  < 2e-16 ***
## pdays_dummy1        1.8921971  0.0795255  23.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34916  on 25563  degrees of freedom
## Residual deviance: 19439  on 25519  degrees of freedom
## AIC: 19529
## 
## Number of Fisher Scoring iterations: 6

2.5.2.2 Features importance

2.5.2.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13442  2482
##          1  1166  8474
##                                          
##                Accuracy : 0.8573         
##                  95% CI : (0.853, 0.8616)
##     No Information Rate : 0.5714         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7042         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7735         
##             Specificity : 0.9202         
##          Pos Pred Value : 0.8790         
##          Neg Pred Value : 0.8441         
##              Prevalence : 0.4286         
##          Detection Rate : 0.3315         
##    Detection Prevalence : 0.3771         
##       Balanced Accuracy : 0.8468         
##                                          
##        'Positive' Class : 1              
## 

2.5.2.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6475  462
##          1  594  450
##                                          
##                Accuracy : 0.8677         
##                  95% CI : (0.8601, 0.875)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3851         
##                                          
##  Mcnemar's Test P-Value : 5.548e-05      
##                                          
##             Sensitivity : 0.49342        
##             Specificity : 0.91597        
##          Pos Pred Value : 0.43103        
##          Neg Pred Value : 0.93340        
##              Prevalence : 0.11427        
##          Detection Rate : 0.05638        
##    Detection Prevalence : 0.13081        
##       Balanced Accuracy : 0.70470        
##                                          
##        'Positive' Class : 1              
## 

2.5.3 Decision Tree

2.5.3.1 Features importance

2.5.3.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13697  1827
##          1   911  9129
##                                          
##                Accuracy : 0.8929         
##                  95% CI : (0.889, 0.8967)
##     No Information Rate : 0.5714         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.779          
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8332         
##             Specificity : 0.9376         
##          Pos Pred Value : 0.9093         
##          Neg Pred Value : 0.8823         
##              Prevalence : 0.4286         
##          Detection Rate : 0.3571         
##    Detection Prevalence : 0.3927         
##       Balanced Accuracy : 0.8854         
##                                          
##        'Positive' Class : 1              
## 

2.5.3.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6575  491
##          1  494  421
##                                           
##                Accuracy : 0.8766          
##                  95% CI : (0.8692, 0.8837)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 0.9948          
##                                           
##                   Kappa : 0.3912          
##                                           
##  Mcnemar's Test P-Value : 0.9492          
##                                           
##             Sensitivity : 0.46162         
##             Specificity : 0.93012         
##          Pos Pred Value : 0.46011         
##          Neg Pred Value : 0.93051         
##              Prevalence : 0.11427         
##          Detection Rate : 0.05275         
##    Detection Prevalence : 0.11465         
##       Balanced Accuracy : 0.69587         
##                                           
##        'Positive' Class : 1               
## 

2.5.4 Decision Tree 2

2.5.4.1 Features importance

2.5.4.4 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13471  1770
##          1  1137  9186
##                                           
##                Accuracy : 0.8863          
##                  95% CI : (0.8823, 0.8902)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7661          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8384          
##             Specificity : 0.9222          
##          Pos Pred Value : 0.8899          
##          Neg Pred Value : 0.8839          
##              Prevalence : 0.4286          
##          Detection Rate : 0.3593          
##    Detection Prevalence : 0.4038          
##       Balanced Accuracy : 0.8803          
##                                           
##        'Positive' Class : 1               
## 

2.5.4.5 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6513  458
##          1  556  454
##                                           
##                Accuracy : 0.8729          
##                  95% CI : (0.8654, 0.8802)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 0.999811        
##                                           
##                   Kappa : 0.4004          
##                                           
##  Mcnemar's Test P-Value : 0.002318        
##                                           
##             Sensitivity : 0.49781         
##             Specificity : 0.92135         
##          Pos Pred Value : 0.44950         
##          Neg Pred Value : 0.93430         
##              Prevalence : 0.11427         
##          Detection Rate : 0.05689         
##    Detection Prevalence : 0.12655         
##       Balanced Accuracy : 0.70958         
##                                           
##        'Positive' Class : 1               
## 

2.5.5 Random Forest (ranger)

2.5.5.1 Results

## Ranger result
## 
## Call:
##  ranger(y ~ ., data = bank_train_smote, num.trees = 1000, importance = "impurity",      write.forest = T, probability = T) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      25564 
## Number of independent variables:  15 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.07880187

2.5.5.2 Features importance

2.5.5.5 Confusion matrix - train data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13810  1383
##          1   798  9573
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.9112, 0.9181)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8246          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8738          
##             Specificity : 0.9454          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.9090          
##              Prevalence : 0.4286          
##          Detection Rate : 0.3745          
##    Detection Prevalence : 0.4057          
##       Balanced Accuracy : 0.9096          
##                                           
##        'Positive' Class : 1               
## 

2.5.5.6 Validation (cut cost and confusion matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6579  483
##          1  490  429
##                                           
##                Accuracy : 0.8781          
##                  95% CI : (0.8707, 0.8852)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 0.9841          
##                                           
##                   Kappa : 0.3997          
##                                           
##  Mcnemar's Test P-Value : 0.8475          
##                                           
##             Sensitivity : 0.47039         
##             Specificity : 0.93068         
##          Pos Pred Value : 0.46681         
##          Neg Pred Value : 0.93161         
##              Prevalence : 0.11427         
##          Detection Rate : 0.05375         
##    Detection Prevalence : 0.11515         
##       Balanced Accuracy : 0.70054         
##                                           
##        'Positive' Class : 1               
## 

2.5.6 Model selection

2.5.6.2 Prediction quality - train

##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.9008535       0.5 0.8571429   0.7734575
## 2  Logistic regression (simple) 0.9007183       0.5 0.8572993   0.7734575
## 3       Decision tree (complex) 0.9201630       0.5 0.8928963   0.8332421
## 4        Decision tree (simple) 0.9077987       0.5 0.8862854   0.8384447
## 5        Random forest (ranger) 0.9721808       0.5 0.9146847   0.8737678

2.5.6.4 Prediction quality - validation

##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7701089       0.5 0.8670593   0.4901316
## 2  Logistic regression (simple) 0.7697226       0.5 0.8676858   0.4934211
## 3       Decision tree (complex) 0.7607222       0.5 0.8765819   0.4616228
## 4        Decision tree (simple) 0.7628825       0.5 0.8729483   0.4978070
## 5        Random forest (ranger) 0.7851047       0.5 0.8780855   0.4703947

2.6 Model selection - conclusion

##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7749897     0.200 0.8634256   0.5394737
## 2  Logistic regression (simple) 0.7756386     0.200 0.8621727   0.5427632
## 3       Decision tree (complex) 0.6926397     0.250 0.8784613   0.4298246
## 4        Decision tree (simple) 0.7396911     0.300 0.8710688   0.5098684
## 5        Random forest (ranger) 0.7861381     0.125 0.8660569   0.5592105
##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7776273       0.5 0.8130560   0.5997807
## 2  Logistic regression (simple) 0.7786925       0.5 0.8104248   0.6063596
## 3       Decision tree (complex) 0.7616706       0.5 0.8178173   0.6162281
## 4        Decision tree (simple) 0.7515769       0.5 0.8085453   0.6315789
## 5        Random forest (ranger) 0.7881108       0.5 0.8209498   0.6304825
##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7767975       0.5 0.8215762   0.5932018
## 2  Logistic regression (simple) 0.7767675       0.5 0.8227039   0.5899123
## 3       Decision tree (complex) 0.7566659       0.5 0.8260870   0.6008772
## 4        Decision tree (simple) 0.7519985       0.5 0.8336048   0.5964912
## 5        Random forest (ranger) 0.7809281       0.5 0.8676858   0.5350877
##                           Model       AUC Threshold  Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7701089       0.5 0.8670593   0.4901316
## 2  Logistic regression (simple) 0.7697226       0.5 0.8676858   0.4934211
## 3       Decision tree (complex) 0.7607222       0.5 0.8765819   0.4616228
## 4        Decision tree (simple) 0.7628825       0.5 0.8729483   0.4978070
## 5        Random forest (ranger) 0.7851047       0.5 0.8780855   0.4703947
##    sampling                         Model       AUC Threshold  Accuracy
## 1        ns Logistic regression (complex) 0.7749897     0.200 0.8634256
## 2        ns  Logistic regression (simple) 0.7756386     0.200 0.8621727
## 3        ns       Decision tree (complex) 0.6926397     0.250 0.8784613
## 4        ns        Decision tree (simple) 0.7396911     0.300 0.8710688
## 5        ns        Random forest (ranger) 0.7861381     0.125 0.8660569
## 6        ds Logistic regression (complex) 0.7776273     0.500 0.8130560
## 7        ds  Logistic regression (simple) 0.7786925     0.500 0.8104248
## 8        ds       Decision tree (complex) 0.7616706     0.500 0.8178173
## 9        ds        Decision tree (simple) 0.7515769     0.500 0.8085453
## 10       ds        Random forest (ranger) 0.7881108     0.500 0.8209498
## 11       us Logistic regression (complex) 0.7767975     0.500 0.8215762
## 12       us  Logistic regression (simple) 0.7767675     0.500 0.8227039
## 13       us       Decision tree (complex) 0.7566659     0.500 0.8260870
## 14       us        Decision tree (simple) 0.7519985     0.500 0.8336048
## 15       us        Random forest (ranger) 0.7809281     0.500 0.8676858
## 16    smote Logistic regression (complex) 0.7701089     0.500 0.8670593
## 17    smote  Logistic regression (simple) 0.7697226     0.500 0.8676858
## 18    smote       Decision tree (complex) 0.7607222     0.500 0.8765819
## 19    smote        Decision tree (simple) 0.7628825     0.500 0.8729483
## 20    smote        Random forest (ranger) 0.7851047     0.500 0.8780855
##    Sensitivity
## 1    0.5394737
## 2    0.5427632
## 3    0.4298246
## 4    0.5098684
## 5    0.5592105
## 6    0.5997807
## 7    0.6063596
## 8    0.6162281
## 9    0.6315789
## 10   0.6304825
## 11   0.5932018
## 12   0.5899123
## 13   0.6008772
## 14   0.5964912
## 15   0.5350877
## 16   0.4901316
## 17   0.4934211
## 18   0.4616228
## 19   0.4978070
## 20   0.4703947
##    sampling                         Model       AUC Threshold  Accuracy
## 1        ns       Decision tree (complex) 0.6926397     0.250 0.8784613
## 2     smote        Random forest (ranger) 0.7851047     0.500 0.8780855
## 3     smote       Decision tree (complex) 0.7607222     0.500 0.8765819
## 4     smote        Decision tree (simple) 0.7628825     0.500 0.8729483
## 5        ns        Decision tree (simple) 0.7396911     0.300 0.8710688
## 6        us        Random forest (ranger) 0.7809281     0.500 0.8676858
## 7     smote  Logistic regression (simple) 0.7697226     0.500 0.8676858
## 8     smote Logistic regression (complex) 0.7701089     0.500 0.8670593
## 9        ns        Random forest (ranger) 0.7861381     0.125 0.8660569
## 10       ns Logistic regression (complex) 0.7749897     0.200 0.8634256
## 11       ns  Logistic regression (simple) 0.7756386     0.200 0.8621727
## 12       us        Decision tree (simple) 0.7519985     0.500 0.8336048
## 13       us       Decision tree (complex) 0.7566659     0.500 0.8260870
## 14       us  Logistic regression (simple) 0.7767675     0.500 0.8227039
## 15       us Logistic regression (complex) 0.7767975     0.500 0.8215762
## 16       ds        Random forest (ranger) 0.7881108     0.500 0.8209498
## 17       ds       Decision tree (complex) 0.7616706     0.500 0.8178173
## 18       ds Logistic regression (complex) 0.7776273     0.500 0.8130560
## 19       ds  Logistic regression (simple) 0.7786925     0.500 0.8104248
## 20       ds        Decision tree (simple) 0.7515769     0.500 0.8085453
##    Sensitivity
## 1    0.4298246
## 2    0.4703947
## 3    0.4616228
## 4    0.4978070
## 5    0.5098684
## 6    0.5350877
## 7    0.4934211
## 8    0.4901316
## 9    0.5592105
## 10   0.5394737
## 11   0.5427632
## 12   0.5964912
## 13   0.6008772
## 14   0.5899123
## 15   0.5932018
## 16   0.6304825
## 17   0.6162281
## 18   0.5997807
## 19   0.6063596
## 20   0.6315789
##    sampling                         Model       AUC Threshold  Accuracy
## 1        ds        Decision tree (simple) 0.7515769     0.500 0.8085453
## 2        ds        Random forest (ranger) 0.7881108     0.500 0.8209498
## 3        ds       Decision tree (complex) 0.7616706     0.500 0.8178173
## 4        ds  Logistic regression (simple) 0.7786925     0.500 0.8104248
## 5        us       Decision tree (complex) 0.7566659     0.500 0.8260870
## 6        ds Logistic regression (complex) 0.7776273     0.500 0.8130560
## 7        us        Decision tree (simple) 0.7519985     0.500 0.8336048
## 8        us Logistic regression (complex) 0.7767975     0.500 0.8215762
## 9        us  Logistic regression (simple) 0.7767675     0.500 0.8227039
## 10       ns        Random forest (ranger) 0.7861381     0.125 0.8660569
## 11       ns  Logistic regression (simple) 0.7756386     0.200 0.8621727
## 12       ns Logistic regression (complex) 0.7749897     0.200 0.8634256
## 13       us        Random forest (ranger) 0.7809281     0.500 0.8676858
## 14       ns        Decision tree (simple) 0.7396911     0.300 0.8710688
## 15    smote        Decision tree (simple) 0.7628825     0.500 0.8729483
## 16    smote  Logistic regression (simple) 0.7697226     0.500 0.8676858
## 17    smote Logistic regression (complex) 0.7701089     0.500 0.8670593
## 18    smote        Random forest (ranger) 0.7851047     0.500 0.8780855
## 19    smote       Decision tree (complex) 0.7607222     0.500 0.8765819
## 20       ns       Decision tree (complex) 0.6926397     0.250 0.8784613
##    Sensitivity
## 1    0.6315789
## 2    0.6304825
## 3    0.6162281
## 4    0.6063596
## 5    0.6008772
## 6    0.5997807
## 7    0.5964912
## 8    0.5932018
## 9    0.5899123
## 10   0.5592105
## 11   0.5427632
## 12   0.5394737
## 13   0.5350877
## 14   0.5098684
## 15   0.4978070
## 16   0.4934211
## 17   0.4901316
## 18   0.4703947
## 19   0.4616228
## 20   0.4298246
##    sampling                         Model       AUC Threshold  Accuracy
## 1        ds        Random forest (ranger) 0.7881108     0.500 0.8209498
## 2        ds        Decision tree (simple) 0.7515769     0.500 0.8085453
## 3        ds       Decision tree (complex) 0.7616706     0.500 0.8178173
## 4        us       Decision tree (complex) 0.7566659     0.500 0.8260870
## 5        us        Decision tree (simple) 0.7519985     0.500 0.8336048
## 6        ds  Logistic regression (simple) 0.7786925     0.500 0.8104248
## 7        ds Logistic regression (complex) 0.7776273     0.500 0.8130560
## 8        us Logistic regression (complex) 0.7767975     0.500 0.8215762
## 9        us  Logistic regression (simple) 0.7767675     0.500 0.8227039
## 10       ns        Random forest (ranger) 0.7861381     0.125 0.8660569
## 11       ns  Logistic regression (simple) 0.7756386     0.200 0.8621727
## 12       ns Logistic regression (complex) 0.7749897     0.200 0.8634256
## 13       us        Random forest (ranger) 0.7809281     0.500 0.8676858
## 14       ns        Decision tree (simple) 0.7396911     0.300 0.8710688
## 15    smote        Decision tree (simple) 0.7628825     0.500 0.8729483
## 16    smote  Logistic regression (simple) 0.7697226     0.500 0.8676858
## 17    smote Logistic regression (complex) 0.7701089     0.500 0.8670593
## 18    smote        Random forest (ranger) 0.7851047     0.500 0.8780855
## 19    smote       Decision tree (complex) 0.7607222     0.500 0.8765819
## 20       ns       Decision tree (complex) 0.6926397     0.250 0.8784613
##    Sensitivity     Perf
## 1    0.6304825 2.081915
## 2    0.6315789 2.071703
## 3    0.6162281 2.050273
## 4    0.6008772 2.027841
## 5    0.5964912 2.026587
## 6    0.6063596 2.023144
## 7    0.5997807 2.012617
## 8    0.5932018 2.007980
## 9    0.5899123 2.002528
## 10   0.5592105 1.984478
## 11   0.5427632 1.947699
## 12   0.5394737 1.942373
## 13   0.5350877 1.937861
## 14   0.5098684 1.890806
## 15   0.4978070 1.868562
## 16   0.4934211 1.854528
## 17   0.4901316 1.847322
## 18   0.4703947 1.818875
## 19   0.4616228 1.799827
## 20   0.4298246 1.738110